Domácí úkol č. 2 - 🕸 Síťová analýza PID¶
📚 Data¶
Jako dataset nám poslouží otevřená data Pražské integrované dopravy. Konkrétně budeme pracovat s daty cestovních řádů, která jsou původně ve formátu GTFS (General Transit Feed Specification). To je formát, který využívá široká škála softwarových aplikací a kvůli tomu jej při publikaci dat využívají také veřejné dopravní agentury včetně PID.
☝️Pozor! Pro řešení domácí úlohy vám poskytneme už připravený dataset $D$ ve formátu csv. Dataset $D$ jsme pro vás sestavili z dat, která pocházejí z cestovních řádů. Více informací o všech souborech a jejich atributech lze nalézt v dokumentaci formátu GTFS.
Zadání¶
☝️ Používejte markdown buňky! Zdůvodňujte všechny důležité kroky, popisujte vizualizace a co je z nich možné pozorovat. Za nepřehledný domácí úkol nebudou uděleny body.
Za řádné průběžné komentování a vizuální prezentaci postupu a výsledků lze získat až 4 body. Úkol řešíte jako jednotlivci.
✨ Dataset
Načtěte si data ze souboru
d.csv, což je již zmíněný dataset $D$, který obsahuje záznam pro každé dvě po sobě jdoucí zastávky nějakého spoje.Struktura je následující (pro zjednodušení neuvažujeme service start_date a end_date): | stop_from | stop_from_name | stop_to | stop_to_name | depart_from | arrive_to | route_type | is_night | mon | tue | wed | thu | fri | sat | sun | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | U699Z3P | Stadion Strahov | U981Z1P | Koleje Strahov | 7:24:00 | 7:25:00 | 3 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
Za jedinečný identifikátor zastávky můžete považovat název zastávky. Pozor na stejné názvy zastávek pro různé dopravní prostředky - u takových zastávek můžete uvažovat, že se jedná o jednu a tutéž stanici (v mnoha případech to i platí).
⚙️ Předzpracování dat
- Atributy
depart_fromaarrive_tobudete chtít nejspíše upravit. Obsahují totiž časové údaje delší než 24 hodin (celkem se jedná o 1.5 % záznamů). Z reference formátu GTFS (info o sloupci, z kterého vznikldepart_fromaarive_to): Arrival time at a specific stop for a specific trip on a route. … For times occurring after midnight on the service day, enter the time as a value greater than 24:00:00 in HH:MM:SS local time for the day on which the trip schedule begins. Nicméně narazíte i na chybné časy, které začínají s hodnotou větší než 24. Všechny tyto případy můžete vyřešit pomocí modulo 24 ☝️.
🕸️ Základní síťová analýza (celkem 12 bodů)
Úkolem je analyzovat síť zastávek PID. Zastávky tedy budou uzly sítě. Mezi dvěma zastávkami je orientovaná hrana, pokud jsou to dvě po sobě jdoucí zastávky nějakého spoje (existuje alespoň jeden záznam v datasetu $D$ s odpovídajícími stop_from, stop_to). Váha hrany je rovna počtu dopravních prostředků, které na dané trase za období jednoho týdne projedou.
Postupujte následovně:
- Začněte volbou libovolného balíčku pro analýzu a vizualizaci sítí (lze využít i zmíněný NetworkX),
- z datasetu $D$ vytvořte reprezentaci dat, která je vhodná pro vámi zvolený vizualizační balíček,
- vytvořte vizualizaci sítě (celkem za 4 body) - vizualizace musí být čitelná, proto můžete vizualizovat i podčást sítě (např. pro určitý dopravní prostředek, např. tramvaje (kromě vizualizace sítě lanovky nebo metra, tu neuznáváme) nebo nějaký podgraf - řešení ponecháme na vás),
- pomocí alespoň tří měr centrality analyzujte důležitost zastávek za období jednoho týdne (pondělí - neděle) a komentujte slovně, co tyto míry vzhledem ke konkrétním datům znamenají (každá míra za 2 body, celkem tedy za 6 bodů),
- vytvořte vizualizaci pro alespoň jednu míru centrality (celkem za 2 body).
❓ Vlastní otázky (3 body za každou otázku, celkem max. 9 bodů)
Vytvořte 3 otázky (můžete i více), založené na filtraci datasetu $D$ a odpovídejte na ně vhodnými vizualizacemi. Otázky pro inspiraci:
- Mění se důležité zastávky v závislosti na denním/nočním provozu?
- Je rozdíl ve vytíženosti zastávek během pracovního týdne/víkendu?
- ...
🔥 Data navíc
V souboru stops.txt je u každé zastávky uvedena zeměpisná šířka a délka. Tato data můžete využít pro rozšíření své analýzy a také vám mohou pomoci při layoutování grafu. ☝️ Pozor na stejné názvy zastávek s trochu jinou lokací pro různé dopravní prostředky. Je třeba navrhnout nějaké řešení (např. první, průměr, těžiště mnohoúhelníku apod., libovolně dle vašeho úsudku) a to zdůvodnit.
Bodové hodnocení¶
Shrnutí bodů, které můžete nejvýše získat:
- 4 body za průběžné komentáře a vizuální prezentaci postupu a výsledků,
- 4 body za vizualizaci sítě,
- 6 bodů za komentovanou analýzu alespoň 3 měr centrality,
- 2 body za vizualizaci jedné z měr centrality,
- 9 bodů za definici a zodpovězení minimálně tří vlastních otázek.
Celkem lze za domácí úkol č. 2 získat maximálně 25 bodů.
pip install plotly networkx pandas matplotlib scipy
Requirement already satisfied: plotly in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (5.24.1) Requirement already satisfied: networkx in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (3.4.2) Requirement already satisfied: pandas in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (2.2.2) Requirement already satisfied: matplotlib in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (3.8.4) Requirement already satisfied: scipy in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (1.14.1) Requirement already satisfied: tenacity>=6.2.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from plotly) (9.0.0) Requirement already satisfied: packaging in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from plotly) (23.2) Requirement already satisfied: numpy>=1.26.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (1.26.4) Requirement already satisfied: python-dateutil>=2.8.2 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: tzdata>=2022.7 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from pandas) (2024.1) Requirement already satisfied: contourpy>=1.0.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (1.3.0) Requirement already satisfied: cycler>=0.10 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (4.54.1) Requirement already satisfied: kiwisolver>=1.3.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (1.4.7) Requirement already satisfied: pillow>=8 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (10.4.0) Requirement already satisfied: pyparsing>=2.3.1 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from matplotlib) (3.1.4) Requirement already satisfied: six>=1.5 in /home/mkh-uni/venvs/machine-learn/lib/python3.12/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) Note: you may need to restart the kernel to use updated packages.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx
import scipy.stats as stats
import plotly.graph_objects as go
import plotly.express as px
d = pd.read_csv('d.csv', dtype={
'monday':bool, 'tuesday':bool, 'wednesday':bool, 'thursday':bool, 'friday':bool, 'saturday':bool, 'sunday':bool, 'is_night':bool,
'route_type':'uint8'
})
d.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1642433 entries, 0 to 1642432 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stop_from 1642433 non-null object 1 stop_from_name 1642433 non-null object 2 stop_to 1642433 non-null object 3 stop_to_name 1642433 non-null object 4 depart_from 1642433 non-null object 5 arrive_to 1642433 non-null object 6 route_type 1642433 non-null uint8 7 is_night 1642433 non-null bool 8 monday 1642433 non-null bool 9 tuesday 1642433 non-null bool 10 wednesday 1642433 non-null bool 11 thursday 1642433 non-null bool 12 friday 1642433 non-null bool 13 saturday 1642433 non-null bool 14 sunday 1642433 non-null bool dtypes: bool(8), object(6), uint8(1) memory usage: 89.3+ MB
stops = pd.read_csv('stops.txt')
stops.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 16435 entries, 0 to 16434 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stop_id 16435 non-null object 1 stop_name 15936 non-null object 2 stop_lat 16435 non-null float64 3 stop_lon 16435 non-null float64 4 zone_id 15408 non-null object 5 stop_url 0 non-null float64 6 location_type 16435 non-null int64 7 parent_station 954 non-null object 8 wheelchair_boarding 16435 non-null int64 9 level_id 954 non-null object 10 platform_code 14750 non-null object 11 asw_node_id 15748 non-null float64 12 asw_stop_id 15354 non-null float64 dtypes: float64(5), int64(2), object(6) memory usage: 1.6+ MB
import unicodedata
def remove_diacritics(x):
'''
Taken from here: https://stackoverflow.com/questions/49891778/conversion-utf-to-ascii-in-python-with-pandas-dataframe
'''
if x is np.nan or x is None:
return x
return unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode()
d.head()
| stop_from | stop_from_name | stop_to | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | U2991Z301 | Hněvice | T58005 | Hněvice seř.n. | 4:53:00 | 4:54:30 | 2 | False | True | True | True | True | True | False | False |
| 1 | T58005 | Hněvice seř.n. | U4610Z301 | Záluží | 4:54:30 | 4:56:00 | 2 | False | True | True | True | True | True | False | False |
| 2 | U4610Z301 | Záluží | U4609Z301 | Dobříň | 4:56:00 | 4:59:00 | 2 | False | True | True | True | True | True | False | False |
| 3 | U4609Z301 | Dobříň | U4608Z301 | Roudnice nad Labem | 4:59:00 | 5:03:00 | 2 | False | True | True | True | True | True | False | False |
| 4 | U4608Z301 | Roudnice nad Labem | U4609Z301 | Dobříň | 4:36:00 | 4:38:00 | 2 | False | True | True | True | True | True | False | False |
We will transform "depart_from" and "arrive_to" features into date time, because it will be easier to work with them that way.
def trnsm(s):
l = s.split(':')
l[0] = str(int(l[0]) % 24)
return ':'.join(l)
d['depart_from'] = pd.to_datetime(d['depart_from'].apply(trnsm), format='%X')
d['arrive_to'] = pd.to_datetime(d['arrive_to'].apply(trnsm), format='%X')
d.head()
| stop_from | stop_from_name | stop_to | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | U2991Z301 | Hněvice | T58005 | Hněvice seř.n. | 1900-01-01 04:53:00 | 1900-01-01 04:54:30 | 2 | False | True | True | True | True | True | False | False |
| 1 | T58005 | Hněvice seř.n. | U4610Z301 | Záluží | 1900-01-01 04:54:30 | 1900-01-01 04:56:00 | 2 | False | True | True | True | True | True | False | False |
| 2 | U4610Z301 | Záluží | U4609Z301 | Dobříň | 1900-01-01 04:56:00 | 1900-01-01 04:59:00 | 2 | False | True | True | True | True | True | False | False |
| 3 | U4609Z301 | Dobříň | U4608Z301 | Roudnice nad Labem | 1900-01-01 04:59:00 | 1900-01-01 05:03:00 | 2 | False | True | True | True | True | True | False | False |
| 4 | U4608Z301 | Roudnice nad Labem | U4609Z301 | Dobříň | 1900-01-01 04:36:00 | 1900-01-01 04:38:00 | 2 | False | True | True | True | True | True | False | False |
Now we will calculate route duration from one stop to another, it can be later transfomed into meaningful weight for edges.
We will also compute number of days, that given commute is active for further computation.
We will remove diacritics from names of stops to unify their spelling and ease programming. We will drop unnecessary columns.
ser = d['arrive_to'] - d['depart_from']
ser[d['arrive_to'] < d['depart_from']] += pd.Timedelta("1 day")
d['route_duration'] = ser
d['days_active'] = d[['monday', 'tuesday', 'wednesday', 'thursday', 'friday', 'saturday', 'sunday']].T.sum()
d['stop_from_name'] = d['stop_from_name'].apply(remove_diacritics).str.lower()
d['stop_to_name'] = d['stop_to_name'].apply(remove_diacritics).str.lower()
d = d.drop(columns=['stop_to', 'stop_from'])
d.head(3)
| stop_from_name | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | route_duration | days_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | hnevice | hnevice ser.n. | 1900-01-01 04:53:00 | 1900-01-01 04:54:30 | 2 | False | True | True | True | True | True | False | False | 0 days 00:01:30 | 5 |
| 1 | hnevice ser.n. | zaluzi | 1900-01-01 04:54:30 | 1900-01-01 04:56:00 | 2 | False | True | True | True | True | True | False | False | 0 days 00:01:30 | 5 |
| 2 | zaluzi | dobrin | 1900-01-01 04:56:00 | 1900-01-01 04:59:00 | 2 | False | True | True | True | True | True | False | False | 0 days 00:03:00 | 5 |
I was able to find description of route type in GTFS documentation, so let's retype this column
route_type = pd.CategoricalDtype([
'Tram', 'Subway', 'Rail', 'Bus', 'Ferry', 'Cable Tram', 'Air Lift', 'Funicular'
])
d['route_type'] = pd.Categorical.from_codes(d['route_type'], dtype=route_type)
display(np.sort(d['route_type'].unique()))
d.head()
array(['Bus', 'Ferry', 'Funicular', 'Rail', 'Subway', 'Tram'],
dtype=object)
| stop_from_name | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | route_duration | days_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | hnevice | hnevice ser.n. | 1900-01-01 04:53:00 | 1900-01-01 04:54:30 | Rail | False | True | True | True | True | True | False | False | 0 days 00:01:30 | 5 |
| 1 | hnevice ser.n. | zaluzi | 1900-01-01 04:54:30 | 1900-01-01 04:56:00 | Rail | False | True | True | True | True | True | False | False | 0 days 00:01:30 | 5 |
| 2 | zaluzi | dobrin | 1900-01-01 04:56:00 | 1900-01-01 04:59:00 | Rail | False | True | True | True | True | True | False | False | 0 days 00:03:00 | 5 |
| 3 | dobrin | roudnice nad labem | 1900-01-01 04:59:00 | 1900-01-01 05:03:00 | Rail | False | True | True | True | True | True | False | False | 0 days 00:04:00 | 5 |
| 4 | roudnice nad labem | dobrin | 1900-01-01 04:36:00 | 1900-01-01 04:38:00 | Rail | False | True | True | True | True | True | False | False | 0 days 00:02:00 | 5 |
Stops dataset¶
stops.head()
| stop_id | stop_name | stop_lat | stop_lon | zone_id | stop_url | location_type | parent_station | wheelchair_boarding | level_id | platform_code | asw_node_id | asw_stop_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | U50S1 | Budějovická | 50.044411 | 14.448787 | P | NaN | 1 | NaN | 1 | NaN | NaN | 50.0 | NaN |
| 1 | U52S1 | Chodov | 50.031672 | 14.490961 | P | NaN | 1 | NaN | 1 | NaN | NaN | 52.0 | NaN |
| 2 | U75S1 | Kolbenova | 50.110395 | 14.516398 | P | NaN | 1 | NaN | 1 | NaN | NaN | 75.0 | NaN |
| 3 | U78S1 | Ládví | 50.126591 | 14.469451 | P | NaN | 1 | NaN | 1 | NaN | NaN | 78.0 | NaN |
| 4 | U100S1 | Vltavská | 50.100298 | 14.438492 | P | NaN | 1 | NaN | 1 | NaN | NaN | 100.0 | NaN |
We will unify zones P, B and 0 and remove diacritics from stops' names and put it into lowercase.
stops['zone_id'] = stops['zone_id'].str.extract(r'(.)$', expand=False)\
.replace({'P' : '0', 'B' : '0'})\
.str.extract(r'(\d+)', expand=False)\
.astype('UInt8')
stops['stop_name_diacritics'] = stops['stop_name']
stops['stop_name'] = stops['stop_name'].apply(remove_diacritics).str.lower()
Next step we will aggregate all possible platforms into stations. We will compute mean of platforms and mode of zones, if some station has stops in different zones. We will also account for wheelchair accessibility.
stations = stops.groupby('stop_name').agg({
'stop_lat' : 'mean',
'stop_lon' : 'mean',
'zone_id' : lambda x: stats.mode(x)[0],
'stop_name_diacritics' : lambda x: list(x)[0],
'wheelchair_boarding' : 'any'
}).reset_index()
stations.head()
| stop_name | stop_lat | stop_lon | zone_id | stop_name_diacritics | wheelchair_boarding | |
|---|---|---|---|---|---|---|
| 0 | ahr km 11,485 | 50.146550 | 14.731470 | <NA> | AHr Km 11,485 | False |
| 1 | albertov | 50.067917 | 14.420798 | 0 | Albertov | True |
| 2 | ametystova | 49.988201 | 14.362216 | 0 | Ametystová | False |
| 3 | amforova | 50.041778 | 14.327298 | 0 | Amforová | False |
| 4 | andel | 50.071132 | 14.403406 | 0 | Anděl | True |
stations.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7542 entries, 0 to 7541 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 stop_name 7542 non-null object 1 stop_lat 7542 non-null float64 2 stop_lon 7542 non-null float64 3 zone_id 6835 non-null UInt8 4 stop_name_diacritics 7542 non-null object 5 wheelchair_boarding 7542 non-null bool dtypes: UInt8(1), bool(1), float64(2), object(2) memory usage: 257.9+ KB
routes = d[(d['route_type'] == 'Tram') & ~d['is_night']]
print(f"Number of samples {len(routes)}")
routes.head()
Number of samples 557785
| stop_from_name | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | route_duration | days_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35575 | vozovna motol | motol | 1900-01-01 06:07:00 | 1900-01-01 06:08:00 | Tram | False | True | True | True | True | True | False | False | 0 days 00:01:00 | 5 |
| 35576 | motol | krematorium motol | 1900-01-01 06:08:00 | 1900-01-01 06:09:00 | Tram | False | True | True | True | True | True | False | False | 0 days 00:01:00 | 5 |
| 35577 | krematorium motol | hlusickova | 1900-01-01 06:09:00 | 1900-01-01 06:10:00 | Tram | False | True | True | True | True | True | False | False | 0 days 00:01:00 | 5 |
| 35578 | hlusickova | slanska | 1900-01-01 06:10:00 | 1900-01-01 06:12:00 | Tram | False | True | True | True | True | True | False | False | 0 days 00:02:00 | 5 |
| 35579 | slanska | blatiny | 1900-01-01 06:12:00 | 1900-01-01 06:13:00 | Tram | False | True | True | True | True | True | False | False | 0 days 00:01:00 | 5 |
Now we will aggregate over "stop_from_name" and "stop_to_name" to gather individual edges. Then we will transform given dataframe into list edges
edges = routes.groupby(['stop_from_name', 'stop_to_name',], as_index=False).size().drop(columns=['size'])
print(f"Number of edges: {edges.shape[0]}")
edges[:5]
Number of edges: 657
| stop_from_name | stop_to_name | |
|---|---|---|
| 0 | albertov | botanicka zahrada |
| 1 | albertov | ostrcilovo namesti |
| 2 | albertov | vyton |
| 3 | andel | arbesovo namesti |
| 4 | andel | bertramka |
Now we will select only those stations, that trams stop at. It will be easier to visualize with them those stops extracted.
tram_stations_set = set(routes['stop_from_name']) | set(routes['stop_from_name'])
tram_stations = stations[stations['stop_name'].isin(tram_stations_set)]
print(f"Number of nodes: {tram_stations.shape[0]}")
tram_stations.head()
Number of nodes: 281
| stop_name | stop_lat | stop_lon | zone_id | stop_name_diacritics | wheelchair_boarding | |
|---|---|---|---|---|---|---|
| 1 | albertov | 50.067917 | 14.420798 | 0 | Albertov | True |
| 4 | andel | 50.071132 | 14.403406 | 0 | Anděl | True |
| 7 | arbesovo namesti | 50.075978 | 14.404773 | 0 | Arbesovo náměstí | True |
| 9 | arena liben jih | 50.102892 | 14.494592 | 0 | Arena Libeň jih | True |
| 42 | balabenka | 50.104022 | 14.482587 | 0 | Balabenka | True |
Plotting network¶
We will use plotly for plotting, because we have almost 300 nodes, so no static image ever could show any information about this network, and plotly provides an interactive plotting tool. Now we will collect coordinates for go.Scattergeo object.
edge_lon, edge_lat = [], []
def update_edges_coords(x, upd_dicts=(edge_lon, edge_lat)):
edge_lon, edge_lat = upd_dicts
from_edge = tram_stations[tram_stations['stop_name'] == x['stop_from_name']].iloc[0]
to_edge = tram_stations[tram_stations['stop_name'] == x['stop_to_name']].iloc[0]
edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]
edges.apply(update_edges_coords, axis=1)
edge_lat[:6]
[50.067917, 50.070775999999995, None, 50.067917, 50.0652065, None]
We will display vltava so it will be easier to orient around the plot. It will be interpolated by 24 points, that we have manually chosen on a map.
vltava_lat = [ 50.14248, 50.1365, 50.12886, 50.11769, 50.11295, 50.11438, 50.11141, 50.11376, 50.10585, 50.09891, 50.09429, 50.09269, 50.08641, 50.07576, 50.0617, 50.05501, 50.02558, 49.99921, 49.99193, 49.98531, 49.97197 ]
vltava_lon = [14.39662, 14.39214, 14.40063, 14.39651, 14.40767, 14.41857, 14.42964, 14.44538, 14.46029, 14.45462, 14.42458, 14.416, 14.41102, 14.41019, 14.41487, 14.41506, 14.39678, 14.40228, 14.39077, 14.37155, 14.34549 ]
from plotly.graph_objs.scattergeo import Marker, Line
fig = go.Figure()
fig.add_trace(
go.Scattergeo(
name='Vltava',
lat=vltava_lat,
lon=vltava_lon,
mode='lines',
hoverinfo='skip',
line=Line(
color='#7ACCE1',
width=8
)
)
)
fig.add_trace(
go.Scattergeo(
name='Tram routes',
lat=edge_lat,
lon=edge_lon,
mode='lines',
hoverinfo='skip',
line=Line(
color='#65151D'
)
)
)
fig.add_trace(
go.Scattergeo(
name='Stations',
lat=tram_stations['stop_lat'].values,
lon=tram_stations['stop_lon'].values,
text=tram_stations['stop_name_diacritics'].values,
mode='markers',
hoverinfo='text',
marker=Marker(
color='#e6e3e3',
size=8,
line_width=2,
)
)
)
fig.update_geos(
fitbounds='locations',
scope='europe',
visible=False
)
fig.update_layout(
height=1200,
width=1200,
margin={"r":0,"t":0,"l":0,"b":0},
legend=dict(
yanchor="top",
y=0.96,
xanchor="left",
x=0.01
)
)
fig.show()
MG = nx.MultiDiGraph()
MG.add_edges_from(d[['stop_from_name', 'stop_to_name']].values)
print(f"Multigraph has {len(MG.nodes())} nodes")
print(f"Multigraph has {len(MG.edges())} edges")
Multigraph has 7391 nodes Multigraph has 1642433 edges
G = nx.DiGraph()
G.add_edges_from(
d.groupby(['stop_from_name', 'stop_to_name']).size().reset_index().drop(columns=[0]).values
)
print(f"Graph has {len(G.nodes())} nodes")
print(f"Graph has {len(G.edges())} edges")
Graph has 7391 nodes Graph has 18465 edges
Calculating centralities¶
Degree centrality¶
Degree centrality assigns each node importance based of it's degree. It makes sense to calculate it for our multigraph, because many incoming and outcoming edges of a node represent high frequency, so more busy stations will have higher score.
Networkx normalizes every degree by $|V(G)| - 1$, where $|V(G)|$ is number of vertices in graph, we will convert it back to whole values.
deg_cent = pd.DataFrame.from_dict(
nx.degree_centrality(MG), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False) * (len(MG) - 1.0)
deg_cent['score'] = deg_cent['score'].astype(np.uint32)
display(deg_cent.describe().T)
print("Top 10 stations with the most neighboring connections")
deg_cent.head(10).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| score | 7391.0 | 444.441348 | 1217.950157 | 1.0 | 40.0 | 84.0 | 245.5 | 17206.0 |
Top 10 stations with the most neighboring connections
| andel | karlovo namesti | i. p. pavlova | palmovka | lihovar | malostranska | narodni trida | narodni divadlo | ohrada | zelivskeho | |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 17206 | 16043 | 12613 | 12515 | 12256 | 11889 | 11698 | 11488 | 11451 | 11274 |
We can see that "the most centered" ones are so called transport hubs, where many means of transport meat. They also happen to be very busy.
Betweeness centrality¶
Betweeness centrality computes number of shortest pathes goingh through given node. Assumption under which it works is that graph will have "radial" shape, and it will be faster to go through center to reach other nodes.
We need to compute shortest path from every node to every other that is accessible from source, it is done by running BFS from every node, so it will have $O(n^2 + nm)$ asymptotics, where $n$ is number of nodes and $m$ is number of edges in graph. So number of operations is at least $7391 ^ 2 + 7391 \cdot 18465 = 1.9e8$ operations. This calculation does not assume our choice of language (Python), so this number is probably much larger. Summing up, it will take eternity to finish this computation, so we will approximate it by using only $k$ randomly chosen nodes as sources for BFS.
btwn_cent = pd.DataFrame.from_dict(
nx.betweenness_centrality(G, k=1000, seed=42), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
display(btwn_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score")
btwn_cent.head(10).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| score | 7391.0 | 0.002611 | 0.006875 | 0.0 | 0.00019 | 0.000824 | 0.002396 | 0.16603 |
Top 10 stations with the highset betweeness centrality score
| roztyly | lihovar | cerny most | chrastany | ostredek | zlicin | nemocnice krc | lhotka | hlavni nadrazi | i. p. pavlova | |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0.16603 | 0.136086 | 0.12791 | 0.119674 | 0.106884 | 0.106022 | 0.099827 | 0.098467 | 0.089731 | 0.082423 |
This metric puts more emphasis on "close to travel" aspect of centerness. Because we have all stations that PID operates on, this metric pushes those, which have more connection to outer parts of network (higher values of zone_id).
Eigenvector centrality¶
Eigenvector centrality estimates importance of a node through recursive assumption that important nodes link to other important nodes. This leads to a system of equations, which can be represented with a matrix-vector equation. Its solution is eigenvector of adjecency matrix. It can be found iteratively using power method.
ev_cent = pd.DataFrame.from_dict(
nx.eigenvector_centrality(G, max_iter=1000), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
display(ev_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score")
ev_cent.head(10).T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| score | 7391.0 | 0.001331 | 0.011556 | 3.174914e-99 | 2.463414e-09 | 4.606858e-07 | 0.000017 | 0.363673 |
Top 10 stations with the highset betweeness centrality score
| cerny most | turnov,terminal u zel.st. | mlada boleslav,aut.st. | mlada boleslav,boleslavska | mlada boleslav,zaluzany skoda | mlada boleslav,jicinska | mnichovo hradiste,zdrav.str. | mlada boleslav,novy hrbitov | svijany,na najezdu | kosmonosy,transcentrum | |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0.363673 | 0.31347 | 0.304207 | 0.289093 | 0.213959 | 0.186643 | 0.184919 | 0.169564 | 0.155258 | 0.151255 |
This metric does a relative score, so it will put higher those stations, that have bridge-like quality. There is a big autobus station on "Černý most" and several others from top ten are also different sorts of autobus stations.
Plotting degree centrality with network¶
We will plot tram network with scores from degree centrality over tram network. It is not the most appropriate way to do that, but it is one of the best, that we can do.
tram_stations_score = pd.merge(deg_cent.reset_index(), tram_stations, left_on='index', right_on='stop_name')
tram_stations_score.head()
| index | score | stop_name | stop_lat | stop_lon | zone_id | stop_name_diacritics | wheelchair_boarding | |
|---|---|---|---|---|---|---|---|---|
| 0 | andel | 17206 | andel | 50.071132 | 14.403406 | 0 | Anděl | True |
| 1 | karlovo namesti | 16043 | karlovo namesti | 50.075203 | 14.418322 | 0 | Karlovo náměstí | True |
| 2 | i. p. pavlova | 12613 | i. p. pavlova | 50.075074 | 14.430867 | 0 | I. P. Pavlova | True |
| 3 | palmovka | 12515 | palmovka | 50.103996 | 14.474715 | 0 | Palmovka | True |
| 4 | lihovar | 12256 | lihovar | 50.050508 | 14.409909 | 0 | Lihovar | True |
from plotly.graph_objs.scattergeo import Marker, Line
fig = go.Figure()
fig.add_trace(
go.Scattergeo(
name='Vltava',
lat=vltava_lat,
lon=vltava_lon,
mode='lines',
hoverinfo='skip',
line=Line(
color='#7ACCE1',
width=8
)
)
)
fig.add_trace(
go.Scattergeo(
name='Tram routes',
lat=edge_lat,
lon=edge_lon,
mode='lines',
hoverinfo='skip',
line=Line(
color='#65151D'
)
)
)
fig.add_trace(
go.Scattergeo(
name='Stations',
lat=tram_stations_score['stop_lat'],
lon=tram_stations_score['stop_lon'],
text=tram_stations_score['stop_name_diacritics'] + ' (' + tram_stations_score['score'].astype(str) + ')',
mode='markers',
hoverinfo='text',
marker=Marker(
color=tram_stations_score['score'],
colorscale='thermal',
size=12,
line_width=1,
colorbar=dict(
title="Number of connections coming and leaving station",
len=0.5
)
)
)
)
fig.update_geos(
fitbounds='locations',
scope='europe',
visible=False
)
fig.update_layout(
height=1200,
width=1200,
margin={"r":0,"t":0,"l":0,"b":0},
legend=dict(
yanchor="top",
y=0.96,
xanchor="left",
x=0.01
)
)
fig.show()
Own questions¶
Betweenness centrality for weekends routes between 5:00 and 6:00¶
There is less motion on weekends and on early mornings, though PID still need to cover most of the station. Hypothetically this leads to less centralized networks, so we will compare betweenness centrality of whole network with betweenness centrality of weekend early mornning centrality.
d_wm = d[
d['saturday'] &
d['sunday'] &
(pd.to_datetime('05:00:00', format='%X') <= d['depart_from']) &
(d['arrive_to'] <= pd.to_datetime('06:00:00', format='%X'))
]
d_wm.head()
| stop_from_name | stop_to_name | depart_from | arrive_to | route_type | is_night | monday | tuesday | wednesday | thursday | friday | saturday | sunday | route_duration | days_active | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | hnevice | hnevice ser.n. | 1900-01-01 05:53:00 | 1900-01-01 05:54:30 | Rail | False | True | True | True | True | True | True | True | 0 days 00:01:30 | 7 |
| 9 | hnevice ser.n. | zaluzi | 1900-01-01 05:54:30 | 1900-01-01 05:56:00 | Rail | False | True | True | True | True | True | True | True | 0 days 00:01:30 | 7 |
| 10 | zaluzi | dobrin | 1900-01-01 05:56:00 | 1900-01-01 05:59:00 | Rail | False | True | True | True | True | True | True | True | 0 days 00:03:00 | 7 |
| 15 | hnevice ser.n. | hnevice | 1900-01-01 05:02:00 | 1900-01-01 05:03:00 | Rail | False | False | False | False | False | False | True | True | 0 days 00:01:00 | 2 |
| 28 | roudnice nad labem | dobrin | 1900-01-01 05:54:00 | 1900-01-01 05:56:30 | Rail | False | True | True | True | True | True | True | True | 0 days 00:02:30 | 7 |
MG_wm = nx.MultiDiGraph()
MG_wm.add_edges_from(d_wm[['stop_from_name', 'stop_to_name']].values)
print(f"Multigraph has {len(MG_wm.nodes())} nodes")
print(f"Multigraph has {len(MG_wm.edges())} edges")
Multigraph has 2812 nodes Multigraph has 16474 edges
btwn_cent_wm = pd.DataFrame.from_dict(
nx.betweenness_centrality(MG_wm, k=1000, seed=42), orient='index'
).rename(columns={0 : 'score'}).sort_values('score', ascending=False)
print('#' * 10, ' WEEKEND MORNING ', '#' * 10)
display(btwn_cent_wm.describe().T)
print("Top 10 stations with the highset betweeness centrality score of weekend early morning network.")
display(btwn_cent_wm.head(10).T)
print('#' * 10, ' ALLWEEK ', '#' * 10)
display(btwn_cent.describe().T)
print("Top 10 stations with the highset betweeness centrality score of allweek network")
display(btwn_cent.head(10).T)
########## WEEKEND MORNING ##########
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| score | 2812.0 | 0.005246 | 0.013047 | 0.0 | 0.000134 | 0.001368 | 0.004378 | 0.159034 |
Top 10 stations with the highset betweeness centrality score of weekend early morning network.
| zlicin | florenc | hlavni nadrazi | u hangaru | lihovar | k letisti | smichovske nadrazi | terminal 1 | muzeum | vltavska | |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0.159034 | 0.132443 | 0.124987 | 0.124706 | 0.121288 | 0.112645 | 0.111882 | 0.097551 | 0.096744 | 0.096244 |
########## ALLWEEK ##########
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| score | 7391.0 | 0.002611 | 0.006875 | 0.0 | 0.00019 | 0.000824 | 0.002396 | 0.16603 |
Top 10 stations with the highset betweeness centrality score of allweek network
| roztyly | lihovar | cerny most | chrastany | ostredek | zlicin | nemocnice krc | lhotka | hlavni nadrazi | i. p. pavlova | |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0.16603 | 0.136086 | 0.12791 | 0.119674 | 0.106884 | 0.106022 | 0.099827 | 0.098467 | 0.089731 | 0.082423 |
Can single mean of transportation cover all stops in the city of Prague (zone 0)?¶
We could have done it by selecting only those edges, that have chosen mean of transport as thier route type, then calling G.is_connected(), but we will do it differently.
We will calculate total number of stops in Prague and then compare it to the number of stations that are accessible by chosen mean of transport from "Hlavní nádraží". If any mean of transport have covered whole network, number of accessible stations would be the same as number of stations in Prague.
This approach gives us measure of coverege density by chosen mean of transportation.
prague_stations = stations[stations['zone_id'] == 0]
prg_stations_set = set(prague_stations['stop_name'])
N = prague_stations.shape[0]
print(f"Number of stations in prague: {N}")
prague_stations.head(2)
Number of stations in prague: 1594
| stop_name | stop_lat | stop_lon | zone_id | stop_name_diacritics | wheelchair_boarding | |
|---|---|---|---|---|---|---|
| 1 | albertov | 50.067917 | 14.420798 | 0 | Albertov | True |
| 2 | ametystova | 49.988201 | 14.362216 | 0 | Ametystová | False |
G_prague = nx.Graph()
G_prague.add_nodes_from(prague_stations['stop_name'].values)
d_prague = d[d['stop_from_name'].isin(prg_stations_set) & d['stop_to_name'].isin(prg_stations_set)]
covering_mean_of_trns = []
for mean_of_transport in d['route_type'].unique():
mt_edges = d_prague.loc[d_prague['route_type'] == mean_of_transport, ['stop_from_name', 'stop_to_name']].values
G_prague.add_edges_from(mt_edges)
n_reachable = len(nx.node_connected_component(G_prague, 'hlavni nadrazi'))
print(f"By {mean_of_transport:<10} it is possible to reach {n_reachable} of {N} stations in Prague")
if n_reachable == N:
covering_mean_of_trns.append(mean_of_transport)
G_prague.clear_edges()
print()
if not covering_mean_of_trns:
print("No mean of transportation covers whole zero zone")
else:
print("Zero zone can be covered by (separatly): {covering_mean_of_trns}")
By Rail it is possible to reach 1 of 1594 stations in Prague By Bus it is possible to reach 1276 of 1594 stations in Prague By Tram it is possible to reach 281 of 1594 stations in Prague By Ferry it is possible to reach 1 of 1594 stations in Prague By Funicular it is possible to reach 1 of 1594 stations in Prague By Subway it is possible to reach 58 of 1594 stations in Prague No mean of transportation covers whole zero zone
What are the longest shortest pathes on trams, ignoring waiting time?¶
This is a question from real world. Every few monthes my family comes to Prague and my mother and I love Prague and long walks around it, but my little old brother gets tired very fast, so we often surrender to "tram-walks", when we travel by tram across Prague. The longer our voyage is the better as my brother will have rested more.
Anecdotes aside, metric that corresponds for longest path is diameter of a graph, generally the longest path in the graph, NetworkX returns length of such path when G.diameter() is called. To retrieve path that we need we will use periphery method, it returns list of nodes that has epicentricity (length of the longest path starting in given node) equal to diameter. both of this methods can be generalized for weighted graphs.
We will take mean of all times, that $D$ dataset says as a duration between a pair of stations. There are several records that have duration of zero, we will not use those.
Because Prague trams go in both direction to and from every station we will represent our network as the undirected graph.
tram_routes = d[d['route_type'] == 'Tram'].groupby(['stop_from_name', 'stop_to_name']).agg(
{'route_duration' : 'mean'}
).reset_index()
tram_routes.loc[:, ['route_duration_seconds']] = tram_routes['route_duration'].dt.seconds
tram_edges = tram_routes[tram_routes['route_duration_seconds'] != 0].drop(columns=['route_duration'])
Here is an exmaple of such instatnt route.
tram_routes[tram_routes['route_duration'].dt.seconds == 0].head(2)
| stop_from_name | stop_to_name | route_duration | route_duration_seconds | |
|---|---|---|---|---|
| 237 | levskeho | sidliste modrany | 0 days | 0 |
| 318 | nademlejnska | u elektry | 0 days | 0 |
d.loc[d['stop_from_name'] == 'levskeho', ['stop_from_name', 'stop_to_name', 'depart_from', 'arrive_to']].tail(2)
| stop_from_name | stop_to_name | depart_from | arrive_to | |
|---|---|---|---|---|
| 1523365 | levskeho | sidliste modrany | 1900-01-01 04:16:00 | 1900-01-01 04:16:00 |
| 1523409 | levskeho | sidliste modrany | 1900-01-01 04:16:00 | 1900-01-01 04:16:00 |
Now we can construct the graph and calculate diameter and find periphery nodes.
G_tram = nx.Graph()
G_tram.add_weighted_edges_from(
tram_edges.values
)
periphery_nodes = nx.periphery(G_tram, weight='weight')
diameter = nx.diameter(G_tram, weight='weight')
print(f"Diameter is: {diameter}")
periphery_nodes
Diameter is: 3553
['lehovec', 'sidliste modrany']
Let's find the shortest path between found stations.
path_nodes = nx.shortest_path(G_tram, periphery_nodes[0], periphery_nodes[1])
path_nodes_set = set(path_nodes)
We will distinguish all pathes by color, so we need to create new datasets, that we will be able to process into Plotly visualisible network.
tram_stations.loc[: , ['is_path']] = False
tram_stations.loc[tram_stations['stop_name'].isin(path_nodes_set), ['is_path']] = True
tram_edges.loc[:, ['is_path']] = False
tram_edges.loc[tram_edges['stop_from_name'].isin(path_nodes_set) & tram_edges['stop_to_name'].isin(path_nodes_set), ['is_path']] = True
tram_edges.head()
| stop_from_name | stop_to_name | route_duration_seconds | is_path | |
|---|---|---|---|---|
| 0 | albertov | botanicka zahrada | 60 | False |
| 1 | albertov | ostrcilovo namesti | 60 | False |
| 2 | albertov | vyton | 107 | False |
| 3 | andel | arbesovo namesti | 86 | False |
| 4 | andel | bertramka | 127 | False |
Finally we can convert all of them into lists of coordinates, suitable for Plotly.
tram_edge_lat = []
tram_edge_lon = []
path_edge_lat = []
path_edge_lon = []
def update_tram_edges_coords(x):
global tram_edge_lat, tram_edge_lon, tram_edge_color, path_edge_lat, path_edge_lon, path_edge_color
from_edge = tram_stations[tram_stations['stop_name'] == x['stop_from_name']].iloc[0]
to_edge = tram_stations[tram_stations['stop_name'] == x['stop_to_name']].iloc[0]
if x['is_path']:
path_edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
path_edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]
else:
tram_edge_lat += [from_edge['stop_lat'], to_edge['stop_lat'], None]
tram_edge_lon += [from_edge['stop_lon'], to_edge['stop_lon'], None]
tram_edges.apply(update_tram_edges_coords, axis=1)
path_edge_lat[:6]
[50.102892, 50.1020335, None, 50.102892, 50.103544, None]
We will use similiar code as we have used for simple network, except we will plot pathes separately.
fig = go.Figure()
fig.add_trace(
go.Scattergeo(
name='Vltava',
lat=vltava_lat,
lon=vltava_lon,
mode='lines',
hoverinfo='skip',
line=Line(
color='#7ACCE1',
width=8
)
)
)
fig.add_trace(
go.Scattergeo(
name='Tram routes',
lat=tram_edge_lat,
lon=tram_edge_lon,
mode='lines+markers',
hoverinfo='skip',
line=Line(
color='#ccbaba'
)
)
)
fig.add_trace(
go.Scattergeo(
name=f'Path. Total time: {diameter // 60}m {diameter % 60}s',
lat=path_edge_lat,
lon=path_edge_lon,
mode='lines+markers',
hoverinfo='skip',
line=Line(
color='#F5B700',
width=8
),
marker=Marker(
color='#F5B700',
size=8,
line_width=2,
)
)
)
tram_stations_path = tram_stations[tram_stations['is_path']]
fig.add_trace(
go.Scattergeo(
lat=tram_stations_path['stop_lat'].values,
lon=tram_stations_path['stop_lon'].values,
text=tram_stations_path['stop_name_diacritics'].values,
mode='markers',
hoverinfo='text',
showlegend=False,
marker=Marker(
color='#F5B700',
size=8,
line_width=2,
)
)
)
fig.update_geos(
fitbounds='locations',
scope='europe',
visible=False
)
fig.update_layout(
height=1200,
width=1200,
margin={"r":0,"t":0,"l":0,"b":0},
legend=dict(
yanchor="top",
y=0.96,
xanchor="left",
x=0.01
)
)
fig.show()